Experiments in Sentence Language Identification with Groups of Similar Languages
نویسندگان
چکیده
Language identification is a simple problem that becomes much more difficult when its usual assumptions are broken. In this paper we consider the task of classifying short segments of text in closely-related languages for the Discriminating Similar Languages shared task, which is broken into six subtasks, (A) Bosnian, Croatian, and Serbian, (B) Indonesian and Malay, (C) Czech and Slovak, (D) Brazilian and European Portuguese, (E) Argentinian and Peninsular Spanish, and (F) American and British English. We consider a number of different methods to boost classification performance, such as feature selection and data filtering, but we ultimately find that a simple naı̈ve Bayes classifier using character and word n-gram features is a strong baseline that is difficult to improve on, achieving an average accuracy of 0.8746 across the six tasks.
منابع مشابه
External Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...
متن کاملMirror Neurons and (Inter)subjectivity: Typological Evidence from East Asian Languages
Language is primarily constituted by action and interaction based on sensorimotor information. This paper demonstrates the nature of subjectivity and intersubjectivity through the neural mechanism and typological evidence of sentence-final particles from East Asian languages and extends to the discussion of the relationship between them. I propose that intersubjecivity is a kind of embedded or ...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملProcessing of Lexical Bundles by Persian Speaking Learners of English
Formulaic sequence (FS) is a general term often used to refer to various types of recurrent clusters. One particular type of FSs common in different registers is lexical bundles (LBs). This study investigated whether LBs are stored and processed as a whole in the mind of language users and whether their functional discourse type has any effect on their processing. To serve these objectives, thr...
متن کاملدر کاربرد تشخیص زبان گفتاری GMM-VSM در قالب سیستم GMM
GMM is one of the most successful models in the field of automatic language identification. In this paper we have proposed a new model named adapted weight GMM (AW-GMM). This model is similar to GMM but the weights are determined using GMM-VSM LID system based on the power of each component in discriminating one language from the others. Also considering the computational complexity of GMM-VSM,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014